adder neural network
Supplementary Materials: An Empirical Study of Adder Neural Networks for Object Detection
As discussed in prior literature [1, 4], one operation of floating-point addition and multiplication have energy costs of 0.9 pJ and 3.7 pJ, respectively. Meanwhile, one operation of 8-bit integer addition and multiplication have 0.03 pJ and 0.2 pJ energy costs, demonstrating much lower cost than floating-point operation. Therefore, it is important to explore whether adder detectors performs well for INT8 quantization. We tried to adopt INT8 post quantization for our Adder FCOS (B+N) model, which suffers 0.8 mAP drop compared with full precision model, as shown in Table A. The energy reduction further increases from 29% to 35%. Note that post training quantization is not optimal for INT8 models, and quantization-aware training may greatly further improve the accuracy.
Kernel Based Progressive Distillation for Adder Neural Networks
Adder Neural Networks (ANNs) which only contain additions bring us a new way of developing deep neural networks with low energy consumption. Unfortunately, there is an accuracy drop when replacing all convolution filters by adder filters. The main reason here is the optimization difficulty of ANNs using $\ell_1$-norm, in which the estimation of gradient in back propagation is inaccurate. In this paper, we present a novel method for further improving the performance of ANNs without increasing the trainable parameters via a progressive kernel based knowledge distillation (PKKD) method. A convolutional neural network (CNN) with the same architecture is simultaneously initialized and trained as a teacher network, features and weights of ANN and CNN will be transformed to a new space to eliminate the accuracy drop. The similarity is conducted in a higher-dimensional space to disentangle the difference of their distributions using a kernel based method. Finally, the desired ANN is learned based on the information from both the ground-truth and teacher, progressively. The effectiveness of the proposed method for learning ANN with higher performance is then well-verified on several benchmarks. For instance, the ANN-50 trained using the proposed PKKD method obtains a 76.8\% top-1 accuracy on ImageNet dataset, which is 0.6\% higher than that of the ResNet-50.
An Empirical Study of Adder Neural Networks for Object Detection
Adder neural networks (AdderNets) have shown impressive performance on image classification with only addition operations, which are more energy efficient than traditional convolutional neural networks built with multiplications. Compared with classification, there is a strong demand on reducing the energy consumption of modern object detectors via AdderNets for real-world applications such as autonomous driving and face detection. In this paper, we present an empirical study of AdderNets for object detection. We first reveal that the batch normalization statistics in the pre-trained adder backbone should not be frozen, since the relatively large feature variance of AdderNets. Moreover, we insert more shortcut connections in the neck part and design a new feature fusion architecture for avoiding the sparse features of adder layers. We present extensive ablation studies to explore several design choices of adder detectors. Comparisons with state-of-the-arts are conducted on COCO and PASCAL VOC benchmarks. Specifically, the proposed Adder FCOS achieves a 37.8% AP on the COCO val set, demonstrating comparable performance to that of the convolutional counterpart with an about $1.4\times$ energy reduction.
Supplementary Material: Progressive Kernel Based Knowledge Distillation for Adder Neural Networks
Thus, Eq.(7) in the main paper can be written as: e Thus, the transformation in Eq.(7) in the main paper can be expressed as a linear combination of In this section, more experimental results of PKKD are conducted. A T [5] on ResNet-20 using CIFAR-10 dataset as shown in Tab. 1. Table 1: Compared with other methods on ResNet-20 using CIFAR-10 dataset.PKKD ANN + dropout Snapshot-KD [3] SP-KD [2] Gift-KD [4] A T [5] 92.96% 92.20% 92.33% 92.38% 92.22% 92.27% Then, we show the superiority of the proposed methods on the traditional CNN distillation. The results are shown in Tab. 2. Table 2: PKKD and KD in CNN distillation. NP' stands for using progressive or fixed teacher.
Review for NeurIPS paper: Kernel Based Progressive Distillation for Adder Neural Networks
Weaknesses: The effectiveness of the kernel method, one of the claimed contributions, is not fully justified. As shown in Table 1, the kernel operation brings insignificant gain on CIFAR 10 with a shallower network of ResNet-20. The gains (below 0.21%) seems insignificant, which may be due to stochastic initialization of networks, suggesting that the proposed kernel scheme may not be so effective as advocated. I advised that comparison on ImageNet with a deeper network (e.g., ResNet-50) is performed. The current experiments are not strong to support that the proposed method is a competitive knowledge distillation method.
Review for NeurIPS paper: Kernel Based Progressive Distillation for Adder Neural Networks
I believe that by bridging the gap between Adder NN and CNNs this work provides a considerable contribution, allowing Adder NN to be considered among practical architecture and encouraging the community to research them further. In accordance with the reviewers, I think the proposed method is thoroughly investigated empirically. Please make sure to update the paper with all the results and answers that you have provided in your rebuttal.
Kernel Based Progressive Distillation for Adder Neural Networks
Adder Neural Networks (ANNs) which only contain additions bring us a new way of developing deep neural networks with low energy consumption. Unfortunately, there is an accuracy drop when replacing all convolution filters by adder filters. The main reason here is the optimization difficulty of ANNs using \ell_1 -norm, in which the estimation of gradient in back propagation is inaccurate. In this paper, we present a novel method for further improving the performance of ANNs without increasing the trainable parameters via a progressive kernel based knowledge distillation (PKKD) method. A convolutional neural network (CNN) with the same architecture is simultaneously initialized and trained as a teacher network, features and weights of ANN and CNN will be transformed to a new space to eliminate the accuracy drop.
An Empirical Study of Adder Neural Networks for Object Detection
Adder neural networks (AdderNets) have shown impressive performance on image classification with only addition operations, which are more energy efficient than traditional convolutional neural networks built with multiplications. Compared with classification, there is a strong demand on reducing the energy consumption of modern object detectors via AdderNets for real-world applications such as autonomous driving and face detection. In this paper, we present an empirical study of AdderNets for object detection. We first reveal that the batch normalization statistics in the pre-trained adder backbone should not be frozen, since the relatively large feature variance of AdderNets. Moreover, we insert more shortcut connections in the neck part and design a new feature fusion architecture for avoiding the sparse features of adder layers.